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We conducted a genome-wide association study to searcli for risk alleles associated with Tetralogy of Fallot 
(TOP), using a northern European discovery set of 835 cases and 5159 controls. A region on chromosome 
12q24 was associated (P = 1.4 x 10"^) and replicated convincingly (P = 3.9 x 10"^) in 798 cases and 2931 
controls [per allele odds ratio (OR) = 1.27 in replication cohort, P= 7.7 x 10~^^ in combined populations]. 
Single nucleotide polymorphisms in the glypican 5 gene on chromosome 13q32 were also associated 
(P= 1.7 X 10"^) and replicated convincingly (P= 1.2 x 10"^) in 789 cases and 2927 controls (per allele 
OR = 1.31 in replication cohort, P= 3.03 x 10~^^ in combined populations). Four additional regions on chro- 
mosomes 10, 15 and 16 showed suggestive association accompanied by nominal replication. This study, the 
first genome-wide association study of a congenital heart malformation phenotype, provides evidence that 
common genetic variation influences the risk of TOF. 
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INTRODUCTION 



RESULTS 



Congenital heart disease (CHD) affects ~ 1 % of live births and 
is a major source of morbidity and mortality in childhood. Ap- 
proximately 20% of CHD occurs in the setting of chromosomal 
conditions or multisystem malformation syndromes. Family 
studies in the remaining 80% of 'sporadic' cases indicate a sig- 
nificant complex genetic component to the disease (1). Previous 
studies have indicated that rare and de novo copy number vari- 
ation in the human genome contribute 5-1 0% of the population 
risk of sporadic CHD (2-5), but genome-wide association 
studies (GWAS) assessing the relationship between common 
single nucleotide polymorphism (SNP) and CHD risk are yet 
to be reported. Tetralogy of Fallot (TOF) is the commonest 
form of cyanotic CHD, affecting ~3 per 10 000 newborns (6). 
Although TOF is usually repaired in infancy with low mortality, 
there is substantial late morbidity, in particular from pulmonary 
valvular insufficiency and atrial and ventricular arrhythmias. 
Population studies suggest a substantial familial recurrence 
risk in sporadic, non-syndromic TOF (1,7). We, therefore, 
undertook a GWAS to identify common genetic risk factors pre- 
disposing to TOF, given the high success of this design over the 
past 5 years (8). 



Figure 1 shows the genome-wide association results obtained 
with two complementary approaches, case/control analysis in 
PLINK and family-based analysis in estimation of maternal, 
imprinting and interaction effects software (EMIM) (see Mate- 
rials and Methods). With PLINK, the strongest signal of asso- 
ciation occurred at a group of SNPs on chromosome 12q24 
(top SNP rsl 1065987, P = 4.6x 10"**), with several other 
regions, including SNPs on chromosomes 3, 10, 13 and 16 
reaching more modest, but suggestive levels of significance 
(P < I X 10~^). Quantile-quantile (QQ) plots (Supplemen- 
tary Material, Fig. SI A) indicated a slight inflation in the 
genome-wide distribution of test statistics (genomic control in- 
flation factor A = 1 .076) (9), possibly due to unmodelled popu- 
lation substructure. Correction using the top 10 principal 
components from EIGENSOFT reduced the genomic control 
inflation factor slightly (A = 1.037, Supplementary Material, 
Fig. SIB) without affecting the overall results substantially 
(rsl 1065987 remained the top SNP, P = 6.0 x 10"**). Correc- 
tion using GenABEL resulted in a reduction in the genomic 
control factor A to 0.996, with no systematic departure from 
the expected line in the resulting QQ plot (Supplementary 
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Figure 1. Manhattan plots of the genome-wide association results. The top panel (PLINK results) shows the - log 1 0 P- values from the Cochran-Armitage trend 
test (for autosomal SNPs) or logistic regression allowing for gender as a covariate (for X-chromosomal SNPs). The bottom panel (EMIM results) shows the 
— loglO P-values from the child trend test (autosomal SNPs only). Dashed lines are shown at significance thresholds for suggestive (/" = 1 x 10 ^) and signifi- 
cant (P = 5 X 10 ^) association, respectively. 
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Material, Fig. SIC). The 12q24 region remained the strongest 
signal of association (Table 1, Supplementary Material, 
Table SI) with a slight decrease in overall significance 
(j>= 1.4 X 10"'' at the top SNP, rsll065987). 

The results from EMIM were found to be broadly concord- 
ant with those from PLINK (Supplementary Material, Fig. S2), 
although this fact is visually less obvious in Figure 1 due to 
slight differences in the precise levels of significance achieved 
at the top ranking SNPs. In EMIM, the strongest signal of as- 
sociation was seen on chromosome 13q32 (rs7982677, P = 
7.4 X 10~^, reducing to _P= 1.7 x 10~ with correction for 
the observed genomic control inflation factor of 1.057). 
Summary-level results of the GWAS in the discovery cohort 
are available at http://www.staff.ncl.ac.uk/heather.cordell/ 
TOFGWAS.html. 

We attempted to replicate all association signals passing 
nominal significance P < 1 x 10~^ (Table 1, Supplementary 
Material, Table SI) using an additional cohort of UK, 
Dutch and Canadian TOF cases and controls of northern 
European ancestry. The chromosome 12q24 region strongly 
replicated (P=9.1 x 10"*^ at rs233722; P=3.9 x 10"^ at 
rs 11065987; combined discovery and replication P= 1.1 x 
10"" at rsl 1065987) with a per allele odds ratio (OR) 
for TOF of 1.27 [95% CI (1.13-1.42)] in the replication 
cohort. The chromosome 13q32 region also strongly replicated 
(_P=1.2x 10~^ at rs7982677, combined discovery and 
replication P= 3.03 x 10"") with a per allele OR for TOF 
of 1.31 [95% CI (1.16-1.48)] in the replication cohort. 
More modest levels of replication were seen at two separate 
regions of chromosome 10 (/" = 0.0018 at rs2388896, P = 
0.0062 at rs2228638) and on chromosomes 15 (P = 0.043 at 
rsl2593223) and 16 (P = 0.033 at rs6499100). 

Figure 2 shows LocusZoom plots (10) from the discovery 
analysis for the two most strongly replicating loci, including 
colour codings of the linkage disequilibrium (LD) values 
between the top scoring SNPs. We used stepwise logistic re- 
gression to determine the extent to which the association 
signal in the chromosome 12q24 region (which was supported 
by a large number of SNPs spanning ~ 1 .4 Mb) could be 
accounted for by the top SNP. In the discovery cohort, inclu- 
sion of rsl 1065987 as a covariate reduced the signal of asso- 
ciation to /* > 10~^ in the vicinity (Supplementary Material, 
Fig. S3), suggesting that LD with rsl 1065987 could account 
for most of the strong associations seen at SNPs in this 
region. In the replication cohort, only two SNPs (rs233722 
and rs233716) retained nominal significance {P < 0.05) once 
rsl 1065987 had been included as a covariate, whereas only 
one SNP (rsl 1065987) retained nominal significance once 
rs233722 had been included as a covariate (Supplementary 
Material, Table S2), suggesting again that the association 
signal could largely be accounted for by a single variant 
within the LD block. 

To further refine the association signal at chromosome 
12q24, we carried out imputation in the discovery cohort 
within the 5 Mb region centred around rsl 1065987. Although 
the association signal was supported by results from a number 
of imputed SNPs (Supplementary Material, Fig. S4), none 
showed significantly stronger association than had already 
been seen at our top genotyped SNP, rsl 1065987. 
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Figure 2. LocusZoom plots for the replicating loci on chromosomes 12 (PLINK results) and 13 (EMIM results). The colour coding indicates the degree of LD of 
each SNP with the named index SNP (shown as a purple diamond), as estimated from the HapMap CEU population. 
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DISCUSSION 

Our study provides compelling evidence that common genetic 
variation influences the risk of CHD. The top SNPs on chromo- 
some 12 implicated by our discovery and replication analyses 
(P= 7.7 X 10"") are located within a large LD block on 
12q24. This region has previously been associated with a 
variety of complex genetic conditions, including type 1 diabetes 
(1 1), celiac disease (12), coronary artery disease (13), rheuma- 
toid arthritis (14), systemic lupus erythematosus (15) multiple 
sclerosis (16) and with blood platelet count (13). The risk 
alleles at the top SNPs in our analysis all lie on a common 
(40% frequency in individuals of White European ancestry) 
1.6 Mb haplotype at chromosome 12q24, containing 15 genes, 
that spans our associated interval (Fig. 2 and Supplementary 
Material, Table S3). A previous analysis (13) provided strong 
evidence that the haplotype has undergone recent positive selec- 
tion in individuals of European ancestry. For all conditions asso- 
ciated with this haplotype described thus far, including TOF, the 
alleles conferring increased risk at the associated SNPs are the 
derived alleles, whose frequencies have increased in Europeans 
by a selective sweep (starting around 3000 years ago). Our 
chromosome 12 results add to the remarkable disease pleiotropy 
associated with variation in the 12q24 region of the human 
genome. 

Imputation analysis in our discovery cohort indicated that, 
although our association signal could be driven by any one 
of a number of imputed SNPs, none show significantly stron- 
ger association than is already seen at our top genotyped SNP 
(rsl 1065987). Further interrogation of this region, ideally in 
non-European cohorts, showing varying LD patterns, will be 
required to further refine the association signal. Within the 
12q24 region, the strongest candidate gene is tyrosine-protein 
phosphatase non-receptor type 11 (PTPNII), a regulator of 
Ras/mitogen-associated protein kinase signalling. Activating 
mutations of PTPNII are a cause of Noonan's syndrome 
(17), a Mendelian multisystem disorder in which malformation 
of the cardiac outflow tract is a typical feature. Association 
between a SNP in PTPNII (rsl 1066320) and TOF was previ- 
ously demonstrated (18) in a candidate gene study using a 
family-based association analysis approach on a set of 754 
TOF cases that partially overlap with the set of cases described 
here. Further research will be necessary to investigate the hy- 
pothesis that the SNP association we have observed in this 
region results from upregulation of the activity of PTPNII 
and the magnitude of effect compared with that of the 
gain-of-function mutations that lead to Noonan's syndrome. 

The observation of association between TOF and an evolu- 
tionarily selected haplotype at 12q24 raises some intriguing 
questions. In contrast with conditions of adult onset, such as cor- 
onary artery disease and rheumatoid arthritis, previously shown 
to be associated with 12q24 SNPs, the condition on which we 
have focussed canies substantial early mortality that signifi- 
cantly impacts reproductive fitness. Prior to the modem 
cardiac surgical era, mortality from TOF before the age of 10 
years was around 80%. Given this, it might be expected that 
any variant conferring even a modest additional risk of TOF 
would have been eliminated, or driven to very low allele fre- 
quency, by natural selection. The nature of the selection event 
responsible for the emergence of the pleiotropic 12q24 risk 



haplotype is unknown, although given the role of genes in the 
region in T-cell function, it is thought likely to relate to 
enhanced resistance to infectious disease at a time of increased 
human population density in Europe (13). Our data on a 
childhood phenotype that until recently was highly lethal 
illustrate the impact of immunity as a selective force in 
human evolution — as has been recently shown for the effects 
of Y chromosome haplogroups on immune function and athero- 
sclerosis risk (19). 

Our findings on chromosome 13 [rs7982677 at 13q32, OR = 
1.31, 95% CI = (1.16-1.48), ^=3.03 x 10""] and in two 
separate regions of chromosome 10 [rs2388896 at 10pl4, 
OR = 0.83, 95% CI = (0.74, 0.93), P = 8.55 x 10"** and 
rs2228638 at lOpll.2, OR= 1.28, 95% CI = (1.07, 1.53), 
P = 2.05 X 10"^] are also significant. The top SNPs in the 
13q32 region lie in intron 7 of the GPC5 gene that encodes 
for glypican 5. Glypicans are heparan sulphate proteoglycans 
that are bound to the outer surface of the plasma membrane 
by a glycosyl-phosphatidylinositol anchor. There are six 
members of the glypican family in mammals, and they serve 
as regulators of key developmental signalling pathways, includ- 
ing the Wnt, Hedgehog, fibroblast growth factor and bone mor- 
phogenetic protein pathways (20). Glypicans are, therefore, 
strong candidate genes for involvement in cellular growth 
control and morphogenesis during heart development. GPC5 
has eight exons and occupies ~ 1 .42 Mb at chromosome 
13p32. The seventh intron, which harbours our top SNPs, is 
735 Kb long and contains three putative non-coding transcripts, 
none of which are overlapped by our top SNPs. SNPs in GPC5 
have been associated with the risk of nephrotic syndrome (21), 
with protection from sudden cardiac arrest (22) and with higher 
risks of lung cancer and multiple sclerosis (23,24). GPC5 is 
among the genes deleted in a subset of patients with 1 3q deletion 
syndrome, a multisystem disorder (typically characterized by 
holoprosencephaly) in which CHD can occur. In a French 
cohort of 12 patients with 13q deletion, 2 had TOF, and 
GPC5 was deleted in both of these patients (25); single patients 
with 13q deletion and TOF have been reported by two other 
groups (26,27). Defects in genes encoding for other members 
of the glypican protein family have been linked to cardiac mal- 
formation syndromes. CHD is common in Simpson-Golabi- 
Behmel syndrome, the condition associated with GPC3 muta- 
tion (28), and has been reported in omodysplasia, the condition 
associated with GPC6 mutation (29). Given the large size of the 
genomic interval spanned by the GPC5 gene, it seems likely that 
the association we have observed here results from an effect on 
the function of GPC5 itself, rather than a neighbouring gene. 

On chromosome 10, rs2388896 lies in a gene desert [the 
nearest gene, GATA-binding transcription factor protein 
family, lies some 0.8 Mb distant], whereas rs2228638 is a non- 
synonymous coding SNP that results in the substitution of Iso- 
leucine for Valine at position 733 of the neuropilin-1 protein, 
which is encoded by the gene NRPI. Neuropilin-1 is a receptor 
for ligands from the vascular endothelial growth factor and 
semaphorin families that is essential for septation of the 
cardiac outflow tract (30). Neuropilin-1 has an important func- 
tion in normal arteriovenous patterning (31), and disruption of 
endothelial neuropilin-1 leads to cardiac outflow tract defects 
similar to those seen in human DiGeorge syndrome (in which 
context TOF is a frequent occurrence) (32). Further research 
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will be required to identify whether the association is due to a 
functional effect of rs2228638 or of some other variant in LD. 

In conclusion, our study has identified two regions (12q24 
and 13q32) that are strongly and replicably associated with 
TOF. As far as we are aware, this is the first genome-wide 
study focussing on SNP associations to have been reported 
in CHD. Pooling patient resources from several international 
groups was required to achieve sufficient numbers of cases 
for adequate power to detect and replicate a number of 
genetic effects; however, studies of commoner diseases have 
emphasized the importance of very large cohorts of cases 
and controls. The present relatively small GWAS cohort has 
in all likelihood detected only the strongest genome-wide 
influences on TOF risk. Our results suggest that larger inter- 
national collaborative studies have the potential to discover 
additional significantly associated loci. 

Complete summary association results from the GWAS 
component of our study (the EIGENSOFT corrected results) 
are available to interested researchers by contacting the corre- 
sponding author. 



MATERIALS AND METHODS 

Study subjects and genotyping 

For the initial (discovery) phase, individuals with TOF of north- 
em European ancestry, together with their parents and affected 
siblings (when available), were recruited from multiple centres 
in Newcastle, Leeds, Bristol, Liverpool, Oxford, Nottingham 
and Leicester (all UK), Leuven (Belgium), Erlangen 
(Germany) and Sydney (Australia). Ethical approval was 
obtained from the local institutional review boards at each of 
the participating centres prior to blood or saliva sample collec- 
tions, and informed consent was obtained from all subjects, or 
from their parents/legal guardians, if the patient was a child too 
young to provide consent him/herself Patients who exhibited 
clinical features of recognized malformation syndromes, devel- 
opmental abnormalities or learning difficulties were excluded 
from the study. All samples were screened for the known 
22qll.2 deletion associated with TOF (33,34) using multiplex 
ligation-dependent probe amplification (MRC-HoUand) and 
excluded, if the deletion was present. SNP genotyping was 
carried out at the Centre National de Genotypage (Evry Cedex, 
France) using the Illumina 660wQUAD array, and the genotypes 
were compared with genotype data for UK population-based con- 
trols (5667 individuals genotyped on the Illumina 1.2M chip) 
obtained from the Wellcome Trust Case Control Consortium 2 
(WTCCC2) (https://ccc.wtccc.org.uk/ccc2). 

For the replication phase, individuals with TOF of self- 
reported Caucasian ancestry were recruited from Oxford, Not- 
tingham, Newcastle, The Netherlands and Canada. Five 
hundred and fifteen Dutch TOF cases were identified from 
the Netherlands national registry and DNA bank of patients 
with CHD (CONCOR), the design of which has been previ- 
ously described (35). One hundred and forty-four Canadian 
TOF cases, together with Canadian controls of self-reported 
Caucasian ancestry, were obtained from the SickKids Heart 
Centre Biobank, an Ontario province-wide biorepository for 
CHD. These replication samples were genotyped at the 
Centre National de Genotypage using Sequenom matrix- 



assisted laser desorption/ionization — time of flight. Additional 
UK population-based control data for the replication were 
obtained from the TwinsUK resource (http://www.twinsuk.a 
c.uk), an adult twin registry comprising 12 000 (predominantly 
female) twins. Genotype data for 3512 twin individuals (gen- 
otyped using the Illumina 6 1 OK array) were obtained from the 
Department of Twin Research and Genetic Epidemiology at 
King's College, London. Only the first twin from each pair 
of genotyped twins (2603 unrelated individuals) was used in 
the current study. 

Statistical analysis 

Following stringent quality control (QC) procedures (see 
below), the final discovery dataset for analysis consisted of 
835 unrelated cases and 5159 controls (plus 717 additional 
family members, including both parents for 293 of the 
cases), genotyped at 516 131 autosomal and X-chromosomal 
SNPs. The primary analysis performed was a case/control ana- 
lysis of the 835 unrelated cases and 5159 controls using either 
the Cochran -Armitage trend test (for autosomal SNPs) or lo- 
gistic regression, allowing for gender as a covariate (for X- 
chromosomal SNPs), using the software PLINK (36). We 
also used two alternative approaches designed to correct for 
possible population stratification: Firstly, we used the 
'smartpca' routine of the EIGENSOFT package (37) to calcu- 
late the top 10 principal components that were entered as cov- 
ariates into a logistic regression analysis of each SNP across 
the genome. Secondly, we analysed each SNP using a score 
test from a linear mixed model (38), allowing for possible re- 
latedness between individuals via consideration of their 
genome-wide estimated kinship coefficients, using the 
'mmscore' function in the R package GenABEL (39). This ap- 
proach has been previously proposed as a means of correcting 
for unknown population structure between apparently unre- 
lated individuals in a genome-wide association study (40). 

In addition to the case/control analyses, we also performed a 
family-based association analysis (of autosomal SNPs only) of 
all cases, their relatives and controls using the software 
package EMIM (41). EMIM is primarily designed for the in- 
vestigation of complex genetic effects, including maternal 
genotype effects, maternal -fetal interactions and imprinting 
(41). In this instance, we used the 'child trend' model in 
EMIM to model only an effect of the child's (case's) own 
genotype. The resulting analysis was, thus, conceptually 
similar to the case/control analysis we had performed in 
PLINK, but with the advantage of allowing additional infor- 
mation from parental genotypes to be incorporated, where 
available. 

The replication cohort (after QC) consisted of 798 cases and 
293 1 controls. Genotypes were analysed using logistic regres- 
sion in PLINK, allowing for two levels of nationality (British/ 
Dutch or Canadian) as a covariate. P-values for association in 
the combined (discovery and replication) datasets were calcu- 
lated using Fisher's trend approach for combining /"-values as 
implemented in the online software MetaP (http://www.svap 
roject.org/metap.php). 

To further refine the association signal at chromosome 
12q24, we carried out imputation in the discovery cohort 
within the 5 Mb region centred around rsl 1065987. We used 



Human Molecular Genetics, 2013, Vol. 22, No. 7 1479 



the program IMPUTE version 2 (42) with data from the 1000 
Genomes Project (Phase I interim data, released June 201 1) as 
a reference panel. Data at 11 208 SNPs passing post- 
imputation QC (from an original 56 637 imputed SNPs) 
were analysed using the program SNPTEST to test for associ- 
ation with disease status. 

QC procedures 

Discovery cohort 

We used stringent QC checks to ensure that only high-quality 
data were included in the final analysis. QC procedures were 
carried out in PLINK version 1.07 (36) with visualization per- 
formed in R (http://www.r-project.org/). For the current study, 
genotype data were generated at 557 124 SNPs across the 
genome for 1733 individuals (comprising 913 TOP cases plus 
a number of unaffected relatives). We excluded individuals 
with genotype call rates <98.78% and average heterozygosities 
outside the range (0.310, 0.336) (based on consideration of 
540 241 autosomal SNPs passing loose QC, namely successful- 
ly genotyped in >95% of individuals and with a Hardy- Wein- 
berg equilibrium test _P-value > 10~**). These exclusion 
thresholds were chosen based on visual inspection of the call 
rates and heterozygosities (Supplementary Material, Fig. S5) 
to retain the majority of individuals, whereas excluding outlying 
individuals. 

We generated a smaller set of 41 692 autosomal SNPs (suc- 
cessfully genotyped in >95% of individuals, with a Hardy- 
Weinberg equilibrium test P-value >10~^, with minor allele 
frequencies >0.4 and pruned to show low levels of LD using 
the PLINK command '-indep 50 5 2') that were used to 
check relationships/sample duplications and ethnicities. 
Genome-wide identity-by-descent (IBD) sharing was calcu- 
lated using the '-Z-genome' command in PLINK, and one of 
each pair of related individuals (mean proportion of alleles 
shared IBD >0.1) was excluded. Multidimensional scaling of 
our samples together with 210 unrelated Phase II HapMap 
(43) individuals from 4 populations [Utah residents with ances- 
try from northern and western Europe (CEU), Japanese in 
Tokyo, Japan, Han Chinese in Beijing, China and Yoruba in 
Ibadan, Nigeria] (genotyped at same set of 41 692 autosomal 
SNPs) was performed and identified 33 individuals in our 
study who did not cluster with the CEU samples, suggesting 
non-European ancestry (Supplementary Material, Fig. S6). 
These individuals were excluded. 

We used the '-check-sex' option in PLINK to check (based 
on the average X chromosomal heterozygosity) that the gender 
of our samples matched its expected value and excluded 
samples for which we were unable to resolve inconsistencies. 

Following QC, we were left with 835 unrelated TOF cases 
(plus 717 additional family members), whose genotypes were 
compared with genotype data from 5159 UK population-based 
controls obtained from the WTCCC2 (https://ccc.wtccc.org. 
uk/ccc2). These controls comprised 2673 samples from the 
1958 British Birth Cohort study (58C) and 2486 National 
Blood Service (NBS) samples (selected from an initially gen- 
otyped set of 2930 58C samples and 2737 NBS samples). We 
excluded the same controls as had been excluded in the 
WTCCC2 (44) and WTCCC3 (45) studies, plus an additional 
four controls that we found to be outliers, following a principal 



components analysis using the 'smartpca' routine of the 
EIGENSOFT package (37). 

Within each of the case and control cohorts, we excluded 
any SNPs with minor allele frequencies <0.01 that were suc- 
cessfiilly genotyped in <95% of individuals or that had a 
Hardy -Weinberg equilibrium test P-value < 10~^. Within 
the two control cohorts, we also implemented SNP exclusions 
recommended by WTCCC2 relating to a measure of the stat- 
istical information in the genotype data about allele frequency 
(exclude if <0.975), missingness (exclude if >2% missing 
genotypes) and plate effects (exclude if P-value from an 
w-degree of freedom test of plate association <1 x 10~^). 
Within the TOF case cohort (for which a number of family 
members had been genotyped), we also excluded SNPs 
showing >10% Mendelian inheritance errors. 

Following an initial association analysis, visual inspection 
of intensity cluster plots (46) was performed for all SNPs 
showing nominal P-value < 10~^, and only those SNPs for 
which the genotype calls appeared reliable (well clustered 
into three distinct groups) were taken forward for replication. 
All SNPs reported here passed this visual inspection of 
intensity cluster plots in the discovery cohort, indicating that 
genotype data and, thus, results at these SNPs could be consid- 
ered reliable. 



Replication cohort 

QC was also performed on the genotype data from the 
TwinsUK replication sample. From the 2603 first twins con- 
sidered, we excluded 43 showing genotype call rates <99% 
and average heterozygosities outside the range (0.312, 
0.331) (based on the consideration of 576 610 autosomal 
SNPs passing loose QC, namely: successfully genotyped in 
>95% of individuals and with a Hardy- Weinberg equilibrium 
test /"-value >10~'*). These exclusion thresholds were chosen 
based on visual inspection of the call rates and heterozygos- 
ities. We carried out testing of relationships/sample duplica- 
tions and ethnicity using the same approach as described 
above for the TOF cohort and excluded first twins who did 
not cluster with the CEU HapMap samples and one of each 
pair of first twins who showed high IBD sharing (mean pro- 
portion of alleles IBD >0.05). This resulted in a final set of 
2547 TwinsUK controls to be used in the replication study. 



Post-imputation QC 

We used the program IMPUTE version 2 (42) to carry out im- 
putation in the discovery cohort within the 5 Mb region 
centred around rsl 1065987. Data from the 1000 Genomes 
Project (47) (Phase I interim data, released June 2011) were 
used as a reference panel. Post-imputation QC involved ex- 
cluding any SNPs likely to be poorly imputed (specifically 
those with an 'info' score <0.5 or with minor allele frequency 
in controls <0.01). Data at 11 208 SNPs passing post- 
imputation QC (from an original 56 637 imputed SNPs) 
were analysed in the program SNPTEST version 2.1.1 (46), 
using the '-method ml' option (a Newton-Raphson algorithm 
that maximizes the missing data likelihood) to account for 
genotype uncertainty. 
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SUPPLEMENTARY MATERIAL 

Supplementary Material is available at HMG online. 
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